Group 8: Zoey Ma, Erin Dougall, Jack Rong
Having exceptional wine taste preferences has become a very revered skill over time, that few people make a career out of. Determining the 'quality' of a wine is based on human preference but these preferences are often influenced by physicochemical and sensory variables (Cortez, P., et al, 2009). We want to see if we can create a more data-driven approach to the classification of wine quality. Similar models have been created and their systems ranked the wines very similarly to experts (Petropoulos, S., et al, 2017).
Our model will answer the question, what quality ranking will a wine receive based on its pH and alcohol levels?
The data set we will be using is the ‘Wine Quality Data Set’ found on UCI and created by researchers at the University of Minho in Portugal. The data set focuses on red Portuguese ‘Vinho Verde’ wines. It has input variables based on physicochemical tests such as acidity, pH, alcohol level, etc. which all lead to the output of a quality score from 0-10.
Input variables (based on physicochemical tests):
import altair as alt
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
wine_quality_data = pd.read_csv("winequality-red.csv",sep=";")#.columns.str.replace(' ', '_')
wine_quality_data.columns = wine_quality_data.columns.str.replace(' ','_',regex=True)
wine_quality_data
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.880 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.99680 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.760 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.99700 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.280 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.99800 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.700 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.99780 | 3.51 | 0.56 | 9.4 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1594 | 6.2 | 0.600 | 0.08 | 2.0 | 0.090 | 32.0 | 44.0 | 0.99490 | 3.45 | 0.58 | 10.5 | 5 |
| 1595 | 5.9 | 0.550 | 0.10 | 2.2 | 0.062 | 39.0 | 51.0 | 0.99512 | 3.52 | 0.76 | 11.2 | 6 |
| 1596 | 6.3 | 0.510 | 0.13 | 2.3 | 0.076 | 29.0 | 40.0 | 0.99574 | 3.42 | 0.75 | 11.0 | 6 |
| 1597 | 5.9 | 0.645 | 0.12 | 2.0 | 0.075 | 32.0 | 44.0 | 0.99547 | 3.57 | 0.71 | 10.2 | 5 |
| 1598 | 6.0 | 0.310 | 0.47 | 3.6 | 0.067 | 18.0 | 42.0 | 0.99549 | 3.39 | 0.66 | 11.0 | 6 |
1599 rows × 12 columns
wine_quality_data.isnull().sum()
# No null values present in data frame.
fixed_acidity 0 volatile_acidity 0 citric_acid 0 residual_sugar 0 chlorides 0 free_sulfur_dioxide 0 total_sulfur_dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
variables = wine_quality_data.columns.values
print(variables)
['fixed_acidity' 'volatile_acidity' 'citric_acid' 'residual_sugar' 'chlorides' 'free_sulfur_dioxide' 'total_sulfur_dioxide' 'density' 'pH' 'sulphates' 'alcohol' 'quality']
#Here we can say most of wines are between 5 to 6 range which is average. 3 is the lowest qulity and 8 is the highest.
wine_quality_data['quality'].value_counts(normalize = True)
5 0.425891 6 0.398999 7 0.124453 4 0.033146 8 0.011257 3 0.006254 Name: quality, dtype: float64
We will use a KNN classifier to predict the wine quality using the volatile acidity and citric_acid columns since they are two common factors that contribute to the wine quality. It's also the most useful features as presented in our predictor box plots below. Although the wine quality is a numeric quantity in the dataset, we will use a classifier rather than regression since the quality is actually an ordinal variable rating (integers from 0-10) so we will treat it as a class/category. We will split it into three labels, which are poor (quality from 0 - 4),normal (quality from 5 -6), and excellent (quality from 7 to 10). We will find the best k-value between 1-100 using cross-validation and grid search with the training set. After doing so, we will use the best k-value to build a model on the entire training set and use it to predict on the test set to determine our classifier's accuracy.
As an intermediate step, we can call the best_score_ and best_accuracy methods which van give us the best accuracy during cross-validation with grid search using mean test score. This will allow us to easily determine the best k-value as well as see how the accuracy changes with different k-values.
We will visualize our results using a confusion matrix to see when and how many times we have predicted the correct label vs. the incorrect label.
We decided to create a correlation map to find the strength and direction of the relationship between variables. The coefficient ranges from -1 to 1, with a value of 0 indicating no correlation, a positive value indicating a positive correlation, and a negative value indicating a negative correlation.
# Correlation map
corr_matrix = wine_quality_data.corr()
# Create a heatmap of the correlations
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")
plt.title('Correlation Map of Red Wine Quality')
# Display the heatmap
plt.show()
# We decided to split quality into 3 labels, which are poor (quality from 0 - 4),normal (quality from 5 -6),
# and excellent (quality from 7 to 10)
bins = [0, 4, 6, 10]
labels = ["poor","normal","excellent"]
wine_quality_data['quality_label'] = pd.cut(wine_quality_data['quality'], bins=bins, labels=labels)
wine_quality_data.drop('quality',axis =1, inplace = True)
wine_quality_data.head()
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality_label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | normal |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | normal |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | normal |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | normal |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | normal |
#Pairplot: to better determine the relationships between pairs of variables in a dataset.
#we decided to create pairlots for our red wine quality
pd.plotting.scatter_matrix(wine_quality_data, figsize=(20, 20))
# Display the pair plot
plt.show()
Then, we will create box plot for these two variables below to compare the distribution of them across different red wine labels.
#Box plot for volatile acidity
sns.histplot(data=wine_quality_data,x="volatile_acidity",hue="quality_label",kde=True)
plt.show()
#Box plot for volatile acidity
sns.histplot(data=wine_quality_data,x="citric_acid",hue="quality_label",kde=True)
plt.show()
To determine wine qulity, volatile acidity and citric acid can be our important features as the overlap of their distributions are vary less in comparison to others.
#Splitting the dataset
wine_train, wine_test = train_test_split(
wine_quality_data, train_size = 0.75
)
wine_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1199 entries, 20 to 1440 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed_acidity 1199 non-null float64 1 volatile_acidity 1199 non-null float64 2 citric_acid 1199 non-null float64 3 residual_sugar 1199 non-null float64 4 chlorides 1199 non-null float64 5 free_sulfur_dioxide 1199 non-null float64 6 total_sulfur_dioxide 1199 non-null float64 7 density 1199 non-null float64 8 pH 1199 non-null float64 9 sulphates 1199 non-null float64 10 alcohol 1199 non-null float64 11 quality_label 1199 non-null category dtypes: category(1), float64(11) memory usage: 113.7 KB
wine_test.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 400 entries, 21 to 1266 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed_acidity 400 non-null float64 1 volatile_acidity 400 non-null float64 2 citric_acid 400 non-null float64 3 residual_sugar 400 non-null float64 4 chlorides 400 non-null float64 5 free_sulfur_dioxide 400 non-null float64 6 total_sulfur_dioxide 400 non-null float64 7 density 400 non-null float64 8 pH 400 non-null float64 9 sulphates 400 non-null float64 10 alcohol 400 non-null float64 11 quality_label 400 non-null category dtypes: category(1), float64(11) memory usage: 38.0 KB
#Calculate the counts of each quality appear in the training dataset
wine_train['quality_label'].value_counts(normalize = True)
normal 0.818182 excellent 0.144287 poor 0.037531 Name: quality_label, dtype: float64
# This shows that there is 0 rows that has missing data in the training dataset
wine_train.isnull().sum()
fixed_acidity 0 volatile_acidity 0 citric_acid 0 residual_sugar 0 chlorides 0 free_sulfur_dioxide 0 total_sulfur_dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality_label 0 dtype: int64
#This table shows the mean value for each predictor variable at each wine wuality level
mean_sum_table = pd.DataFrame()
wine_vars = ['fixed_acidity','volatile_acidity','citric_acid',
'residual_sugar','chlorides','free_sulfur_dioxide',
'total_sulfur_dioxide','density','pH','sulphates','alcohol']
for var in wine_vars:
mean_sum_table['mean_',var] = wine_train.groupby(wine_train['quality_label'])[var].mean()
#remove the '(' and ')' in colomn names
new_names = {col: var for col, var in zip(mean_sum_table.columns, wine_vars)}
mean_sum_table = mean_sum_table.rename(columns=new_names)
mean_sum_table
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| quality_label | |||||||||||
| poor | 8.040000 | 0.688889 | 0.174000 | 2.782222 | 0.091778 | 11.755556 | 34.533333 | 0.996834 | 3.381556 | 0.592444 | 10.226667 |
| normal | 8.271254 | 0.536473 | 0.261539 | 2.533384 | 0.090384 | 16.633537 | 49.683996 | 0.996890 | 3.311335 | 0.648787 | 10.263439 |
| excellent | 8.875723 | 0.408353 | 0.375029 | 2.713873 | 0.076942 | 14.043353 | 34.283237 | 0.996030 | 3.290231 | 0.749942 | 11.533333 |
Training Data Visualization
# To see how data is distributed for every column: we create distribution plots for each of the predictor varibles
wine_vars = ['fixed_acidity','volatile_acidity','citric_acid',
'residual_sugar','chlorides','free_sulfur_dioxide',
'total_sulfur_dioxide','density','pH','sulphates','alcohol']
var_plots = []
for var in wine_vars:
var_plot = (
alt.Chart(wine_train)
.mark_bar()
.encode( x= alt.X(var, title = (var.replace('_', ' '), "value")),
y=alt.Y("count()", title = ("density")),
opacity=alt.value(0.5),
color = alt.value('purple')
)
)
var_plots.append(var_plot)
for var_plot in var_plots:
var_plot.display()
/opt/conda/lib/python3.10/site-packages/altair/utils/core.py:317: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead. for col_name, dtype in df.dtypes.iteritems():
wine_concav = (
alt.Chart(wine_quality_data)
.mark_circle()
.encode(
x="volatile_acidity",
y="citric_acid",
color= alt.Color("quality_label"))
)
wine_concav
#Preprocess the data:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
#Preprocess the data:
wine_preprocessor = make_column_transformer (
(StandardScaler(), ["volatile_acidity","citric_acid"]),
)
#train the classifier
knn = KNeighborsClassifier(n_neighbors = 2)
X = wine_train.loc[:,["volatile_acidity","citric_acid"]]
y = wine_train["quality_label"]
X_train_sc= X.to_numpy()
y_train_sc= y.to_numpy()
X_test = wine_test.loc[:,["volatile_acidity","citric_acid"]]
y_test = wine_test["quality_label"]
X_test_sc= X.to_numpy()
y_test_sc= y.to_numpy()
knn_fit = make_pipeline(wine_preprocessor,knn).fit(X,y)
wine_test_predictions = wine_test.assign(
predicted = knn_fit.predict(wine_test.loc[:,["volatile_acidity","citric_acid"]])
)
#wine_test_predictions
wine_test_predictions[['quality_label','predicted']]
correct_preds = wine_test_predictions[
wine_test_predictions['quality_label'] == wine_test_predictions
]
correct_preds.shape[0] / wine_test_predictions.shape[0]
knn_fit
/tmp/ipykernel_133/2660049447.py:36: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do `left, right = left.align(right, axis=1, copy=False)` before e.g. `left == right` wine_test_predictions['quality_label'] == wine_test_predictions
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['volatile_acidity',
'citric_acid'])])),
('kneighborsclassifier', KNeighborsClassifier(n_neighbors=2))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['volatile_acidity',
'citric_acid'])])),
('kneighborsclassifier', KNeighborsClassifier(n_neighbors=2))])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['volatile_acidity', 'citric_acid'])])['volatile_acidity', 'citric_acid']
StandardScaler()
KNeighborsClassifier(n_neighbors=2)
wine_acc1 = knn_fit.score(
wine_test.loc[:,["volatile_acidity","citric_acid"]],
wine_test['quality_label']
)
wine_acc1
0.7325
The accuracy with K=2 is 73.25%.
# Parameter value selection
knn = KNeighborsClassifier()
wine_tune_pipe = make_pipeline(wine_preprocessor,knn)
parameter_grid = {
"kneighborsclassifier__n_neighbors":range(1,100,2),
}
wine_tune_grid = GridSearchCV(
estimator = wine_tune_pipe,
param_grid = parameter_grid,
cv=5,
n_jobs=-1
)
gs_results = gs.fit(X_train_sc, y_train_sc)
accuracies_grid = pd.DataFrame(
wine_tune_grid
.fit(wine_train.loc[:,["volatile_acidity","citric_acid"]],
wine_train["quality_label"]
).cv_results_)
accuracies_grid = accuracies_grid[["param_kneighborsclassifier__n_neighbors", "mean_test_score", "std_test_score"]
].assign(
sem_test_score = accuracies_grid["std_test_score"] / 5**(1/2)
).rename(
columns = {"param_kneighborsclassifier__n_neighbors" : "n_neighbors"}
).drop(
columns = ["std_test_score"]
)
accuracies_grid.head()
| n_neighbors | mean_test_score | sem_test_score | |
|---|---|---|---|
| 0 | 1 | 0.763152 | 0.007488 |
| 1 | 3 | 0.779801 | 0.005455 |
| 2 | 5 | 0.786475 | 0.005927 |
| 3 | 7 | 0.798992 | 0.005868 |
| 4 | 9 | 0.800649 | 0.009397 |
print('Best Accuracy: ', gs_results.best_score_)
print('Best Parametrs: ', gs_results.best_params_)
Best Accuracy: 0.8223535564853556
Best Parametrs: {'n_neighbors': 43}
From above, we can say that our best model is with K = 43. To observe its accuracy, we can re-perform the fitting to output the new accuracy.
knn_43 = KNeighborsClassifier(n_neighbors = 43)
knn_fit_43 = make_pipeline(wine_preprocessor,knn_11).fit(X,y)
wine_acc = knn_fit_11.score(
wine_test.loc[:,["volatile_acidity","citric_acid"]],
wine_test['quality_label']
)
wine_acc
0.8475
#Confusion matrix for KNN
print(pd.DataFrame(y_test)['quality_label'].value_counts())
normal 338 excellent 44 poor 18 Name: quality_label, dtype: int64
pd.crosstab(
wine_test_predictions['quality_label'],
wine_test_predictions['predicted']
)
| predicted | excellent | normal | poor |
|---|---|---|---|
| quality_label | |||
| poor | 0 | 17 | 1 |
| normal | 75 | 263 | 0 |
| excellent | 29 | 15 | 0 |
Our expected outcome from this strategy is to have a model that can predict the quality of a Portuguese “Vinho Verde” as accurately similarily to wine experts as possible.
Using a data mining approach to classifying wine qualities could have a huge significance to the wine industry. When it is time for new wines to be certified, many countries require by law for the sensory analysis to be done by human testers. However, all testers have their own unique experience and thus their analysis is inherently biased. Our approach to classification remains objective. Some researchers suggest these data-driven approaches could aid in the efficiency of wine evaluation; for example, an expert has to repeat their evaluation only if there is a significant difference between their classification and the model’s (Cortez, P., et al, 2009). Looking to the future, could classification models like ours aid new winemakers in legitimizing their products without the need for expensive evaluations?
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J. (2009). Modelling wine preference by data mining from physicochemical properties. Decision Support Systems, 47(4), 547-553. https://doi.org/10.1016/j.dss.2009.05.016
Petropoulos, S., Karavas, C. S., Balafoutis, A. T., Paraskevopoulos, I., Kallithraka, S., Kotseridis, Y. (2017). Fuzzy logic tool for wine quality classification. Computers and Electronics in Agriculture, 142(Part B), 552-562. https://doi.org/10.1016/j.compag.2017.11.015